Automatic Phonemic Labeling and Segmentation of Spoken Dutch

نویسندگان

  • Kris Demuynck
  • Tom Laureys
  • Patrick Wambacq
  • Dirk Van Compernolle
چکیده

The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A statistical phonemic segment model for speech recognition based on automatic phonemic segmentation

This paper presents a method of constructing a statistical phonemic segment model (SPSM) for a speech recognition system based on speaker-independent context-independent automatic phonemic segmentation. In our recent research, we proposed the phoneme recognition system using the template matching method with the same segmentation, and confirmed that 5-frame-fixed time sequence of feature vector...

متن کامل

BALDEY: A database of auditory lexical decisions.

In an auditory lexical decision experiment, 5541 spoken content words and pseudowords were presented to 20 native speakers of Dutch. The words vary in phonological make-up and in number of syllables and stress pattern, and are further representative of the native Dutch vocabulary in that most are morphologically complex, comprising two stems or one stem plus derivational and inflectional suffix...

متن کامل

Assessing Segmentations: Two Methods for Confidence Scoring Automatic HMM-Based Word Segmentations

The Dutch-Flemish project Spoken Dutch Corpus (1998-2003) aims at the development of an annotated corpus of 10 million spoken words. In order to make the speech data easily accessible, a word segmentation couples the orthographic transcription to the speech signal by means of time stamps. Generally, such segmentations are produced manually. Since this manual procedure is a time-consuming effort...

متن کامل

Word Segmentation in the Spoken Dutch Corpus

ELIS, University of Ghent, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium martens,odul,rvparijs @elis.rug.ac.be Dept Language & Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands [email protected] ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium kris.demuynck,tom.laureys,jacques.duchateau @esat.kuleuven.ac.be Abstract This paper describe...

متن کامل

The AUTONOMATA Spoken Names Corpus

In the Autonomata project we have collected a corpus of spoken name utterances with manually corrected phonemic transcriptions of these utterances. The corpus was designed with the intention to become a major resource for the development of automatic speech recognition engines that can achieve a high accuracy on the recognition of person and geographical names spoken in Dutch. The recorded name...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004